Gaudi: add CI #3160

baptistecolle · 2025-04-10T09:06:13Z

What does this PR do?

This PR adds CI support for the Gaudi backend. It includes an integration test that starts the model "meta-llama/Llama-3.1-8B-Instruct", performs a few requests, and verifies that the outputs match the expected results.

Additional models are also supported, but running tests for all of them is quite slow, so they are not included in the CI by default. However, instructions on how to run the integration tests for all supported models have been added to the Gaudi backend README.

baptistecolle · 2025-04-22T10:01:45Z

I’ll wait for the Gaudi integration test CI to pass before merging anything:
https://github.com/huggingface/text-generation-inference/actions/runs/14591230970/job/40927197928?pr=3160

The previous run was green, which gives me confidence in the current changes:
https://github.com/huggingface/text-generation-inference/actions/runs/14384130453/job/40336095297

Unfortunately, it can take days to get assigned a Gaudi1 runner 😭, so I figured I could start iterating on your reviews in the meantime rather than wait for the CI to finish before requesting feedback. In any case, I’ll only merge once the Gaudi integration test passes in the CI also

regisss

LGTM!

We should soon have access to Gaudi2 and Gaudi3 ephemeral runners on demand, which will makes things much easier than waiting for a DL1 instance. I suggest we wait for this to be available to update and merge this PR.

baptistecolle · 2025-04-23T07:42:32Z

Ok, I will wait for the new runners before adding Gaudi to the CI, as indeed the DL1 runners are super unreliable

Narsil

LGTM

baptistecolle · 2025-05-21T11:49:30Z

The runners for Gaudi are ready! 🙌 Thanks @regisss

Just requesting some new reviews to be sure everything is still okay. Since the last review I just rebased on main and use the new runners. Now the integration test are passing and the runners are super fast! https://github.com/huggingface/text-generation-inference/actions/runs/15160963395/job/42627380206?pr=3160

regisss · 2025-05-21T12:47:44Z

.github/workflows/build.yaml

@@ -129,9 +129,9 @@ jobs:
                export label_extension="-gaudi"
                export docker_volume="/mnt/cache"
                export docker_devices=""
-                export runs_on="ubuntu-latest"
+                export runs_on="itac-bm-emr-gaudi3-dell-1gaudi"


All tests are going to pass with 1 device only? Big (i.e. 70B+ parameters) models are not tested?

Indeed, I disable big models for testing and only did a small model for faster iteration. I just activated back a mutli-card test and the test is broken 😬. There seems to be a regression between the original PR and the latest TGI backend so I am looking into it 👀. Also the error is different based on the hardware Gaudi 1 vs 3 😣

regisss · 2025-05-22T07:18:10Z

@baptistecolle A couple of questions:

It's not possible to select a specific runner for each test config right?
If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?

baptistecolle · 2025-05-22T07:22:51Z

@baptistecolle A couple of questions:

It's not possible to select a specific runner for each test config right?

If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?

No it is not i think this would require some rework of the build workflow which is global for all the hardwares. The best alternative would be to use a runner with 8 card and then set HABANA_VISIBLE_DEVICES=1
Yes, that's correct.

Some additional useful remark: you also need to add the new config with "run_by_default": True

text-generation-inference/integration-tests/gaudi/test_gaudi_generate.py

Line 35 in c3241f4

"run_by_default": True,

to run in the CI as there a lot of test, for faster CI testing I only run a subset of the test on the CI and not all the possible model we support

baptistecolle force-pushed the gaudi/add-ci branch from dd187d2 to 119bdbd Compare April 22, 2025 08:43

baptistecolle requested review from Narsil and regisss April 22, 2025 09:56

baptistecolle marked this pull request as ready for review April 22, 2025 10:01

regisss reviewed Apr 22, 2025

View reviewed changes

regisss mentioned this pull request Apr 22, 2025

Add integration tests for Gaudi huggingface/text-embeddings-inference#598

Draft

baptistecolle marked this pull request as draft April 23, 2025 07:42

Narsil previously approved these changes Apr 23, 2025

View reviewed changes

baptistecolle dismissed Narsil’s stale review via 29b9c32 May 21, 2025 11:27

baptistecolle force-pushed the gaudi/add-ci branch from 7bfba23 to 29b9c32 Compare May 21, 2025 11:27

baptistecolle requested review from Narsil and regisss May 21, 2025 11:47

baptistecolle marked this pull request as ready for review May 21, 2025 11:49

regisss reviewed May 21, 2025

View reviewed changes

baptistecolle marked this pull request as draft May 22, 2025 07:24

baptistecolle and others added 10 commits May 22, 2025 14:13

wip(test): adding test to ci

b3b5bde

wip: able to launch gaudi tests

0b31188

feat(ci): llama3 test working

71a9aa3

feat(ci): llama3 test working

e203250

fix llama failing test

ee0fab4

wip(ci): rerun ci to debug

bd5f5ce

Update tests.yaml

42ed178

Update tests.yaml

6f77723

wip(ci): debug the ci

557781f

wip(ci): debug the ci

4b9a83a

baptistecolle added 6 commits May 22, 2025 14:13

change defualt behaviour to only run a subset of all the models

88797d4

change defualt behaviour to only run a subset of all the models

9aa845e

testing

76632c8

feat(gaudi/ci): added ci for gaudi device

d6fc50b

add new gaudi3 runners

94a4187

enable multi-card test

55cdfbf

baptistecolle force-pushed the gaudi/add-ci branch from c3241f4 to 55cdfbf Compare May 22, 2025 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gaudi: add CI #3160

Gaudi: add CI #3160

Uh oh!

baptistecolle commented Apr 10, 2025 •

edited

Loading

Uh oh!

baptistecolle commented Apr 22, 2025 •

edited

Loading

Uh oh!

regisss left a comment

Uh oh!

baptistecolle commented Apr 23, 2025

Uh oh!

Narsil left a comment

Uh oh!

baptistecolle commented May 21, 2025

Uh oh!

regisss May 21, 2025

Uh oh!

baptistecolle May 22, 2025 •

edited

Loading

Uh oh!

regisss commented May 22, 2025

Uh oh!

baptistecolle commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Gaudi: add CI #3160

Are you sure you want to change the base?

Gaudi: add CI #3160

Uh oh!

Conversation

baptistecolle commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

baptistecolle commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

baptistecolle commented Apr 23, 2025

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

baptistecolle commented May 21, 2025

Uh oh!

regisss May 21, 2025

Choose a reason for hiding this comment

Uh oh!

baptistecolle May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

regisss commented May 22, 2025

Uh oh!

baptistecolle commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Apr 10, 2025 •

edited

Loading

baptistecolle commented Apr 22, 2025 •

edited

Loading

baptistecolle May 22, 2025 •

edited

Loading

baptistecolle commented May 22, 2025 •

edited

Loading